2.8 How to aggregate and link datasets
Datasets are usually built up through the command import
, which adds one and one variable measured at specified custom dates. Such variables must have the same unit type, usually persons represented by the unique keyidentifier PERSONID_1. The linking is done automatically through the analysis system, so that the user only has to deal with the import
-command, specifying variable name, measurement date and optional alias.
The analysis system makes it also possible to analyze data measured over other unit types such as event level, municipal level, family level, course level, job level etc. Data with unit types other than persons cannot be imported directly into an individual level dataset (datasets with persons as unit type, given by the key identifier variable PERSONID_1). They must first be processed into the appropriate unit level given by the variable to be used as a link key in the target data set. Only then the datasets can be linked together using the command merge
.
Data at a lower unit level than person, e.g. event or course level1, must be aggregated to person level using the command collapse()
before merging the data into an individual level datasett using merge
. The collapse()
-command does two operations:
-
Aggregates data into a higher unit level given by an identifier variable. In principle, all categorical variables may be used as identifier, e.g. by using municipality as identifier, you can aggregate data into municipality level data
-
Performs a summary calculation across the units for each of the new aggregated units. Measurement method is indicated in parentheses after the command. The following options are available:
-
mean
(mean) -
median
(median) -
min
(minimum) -
max
(maximum) -
count
(number of observations) -
sum
(sum) -
semean
(standard error of mean) -
sebinomial
(binominal standard error of mean) -
sd
(standard deviation) -
percent
(percentage of valid values) -
iqr
(interquartile range) -
gini
(gini coefficient)
-
Data at the same or higher unit level than person, e.g. municipality or family level, however, can be connected to a personal data set using the corresponding variable in the target data set (using it as a link key).
Example on aggregating course data (data on ongoing studies) from course level to person level for merging into a person level dataset:
collapse(max) edulevel, by(personid)
rename edulevel highest_edulevel
merge highest_edulevel into persondataset
Example on aggregating from person level to family level (sums income across all family members within each family and calculate family income), and then merging family income into a person level dataset:
collapse(sum) income, by(familynumber)
rename income familyincome
merge familyincome into persondataset on familynumber
Note that in the example above, the link variable is specified through the expression on familynumber
. This must always be done if you use other link variables than the main identifier of the target dataset.
Tip! It is possible to change the name of the aggregated variables inside the collapse() expression itself. The example above can thus be simplified as follows:
collapse(sum) income -> familyincome, by(familynumber)
merge familyincome into persondataset on familynumber
It is possible to aggregate over several variables in the same operation, and you can use different measurement types for each individual variable. It is also possible to use several measurement types for a single variable. If you only specify one measurement type, this is used for all variables. If you want different ones for each variable, enter the measurement type in brackets in front of each of the variables.
Example: Counts the number of courses and finds courses with the highest education code for each individual. Then connects to a personal data set:
create-dataset coursedata
import db/NUDB_KURS_NUS 2020-11-01 as coursetype
import db/NUDB_KURS_FNR as fnr
destring coursetype
collapse (count) coursetype -> number_courses (max) coursetype -> highest_level, by(fnr)
merge number_courses highest_level into residents
Example: Counts the number of courses, and finds the course with the highest education code as well as the course that started first, for each individual. Then connects to a personal dataset. The variable START@coursetype
is used to find the course that started first. This is additional information that is only available if you use the import-event
command:
//Getting people who are taking higher education during 2020
create-dataset coursedata
import-event db/NUDB_KURS_NUS 2020-01-01 to 2020-12-31 as coursetype
destring coursetype
keep if coursetype >= 700000 & coursetype < 900000
//Connects link between course ID and birth number
create-dataset link_course_person
import db/NUDB_KURS_FNR as fnr
merge fnr into coursedata
//Create statistics (collapse) on the number of events with high education per individual, highest level of education and first course, and connect this to the personal dataset
use coursedata
collapse (count) coursetype -> number_courses (max) coursetype -> highest_level (min) START@coursetype -> first_course, by(fnr)
merge number_courses highest_level first_course into residents
Click on the links below for examples on how to link information on parents, families and courses into your individual level dataset. The latter illustrates the interconnections of data at a lower level than persons (course data are information on ongoing education represented by the relevant course/subject taken at a given time, where people can take several courses simultaneously). Data on parents and families illustrate interconnections of data at a higher level than person.
Use the help collapse
command in the analysis tool for more information on what you can use collapse
for.
Example: How to use parental information in analysis
Example: How to aggregate data into family level
Example: Collect information on ongoing education during a year
Example: Collect information on ongoing education for a given date
Footnotes
-
Course data is slightly different from other personal data as these data even after extraction at a given time contain personal data with several observations per individual. This reflects the fact that it is possible to participate in several different courses / studies at the same time. The same principle applies to job data where it is possible to have several jobs at the same time. ↩